Introduction

I selected the Apple Music top 100 playlists from different countries because I was curious to explore the cultural variations in music preferences across regions. By comparing these playlists, I aimed to identify common trends in popular songs as well as uncover unique regional music tastes. I wanted to understand if there were any songs that consistently appeared across multiple playlists, indicating their widespread popularity regardless of geographical boundaries. Additionally, I was interested in observing any distinct patterns or genres that dominated specific countries’ playlists, which could provide insights into the local music scenes and cultural influences.

Suppose I want to find common songs between my playlist, “Good Vibes”, and my friend’s playlist, “New hits”, Using SQL, I can perform an INNER JOIN operation on the song titles from both playlists. The query would look like this:

SELECT song_title
FROM good_vibes
INNER JOIN new_hits ON good_vibes.song_title = new_hits.song_title;

This query would retrieve the song titles that appear in both playlists. SQL queries allow for easy comparison of playlists, enabling me to discover shared music interests and initiate engaging conversations about our favorite tunes.

Data Creations

Plot: Mean Song and Number of Songs by year

The combined plot tells a compelling story about the evolution of music trends over time. The line plot, mean_song_length, reveals any trends or patterns in the mean song length over the years. By examining the line, we can observe whether songs have become shorter or longer on average. The bar plot, total_songs, showcases the total count of songs released in each year, providing an overview of music production volume. Together, these plots enable us to explore the relationship between song length and song production trends in one static plot instead of two.

The visualization sparks questions and invites further analysis. Are there any significant deviations from the mean song length trend in specific years? How do different genres contribute to these patterns? By encouraging exploration and interpretation, the visualization opens up avenues for deeper investigations into the intricate dynamics of music trends and cultural shifts.

Top Songs

Tops Songs Gif

Bottoms Songs

Bottoms Songs Gif

In this data visualization project, I utilized my skills to create an engaging visual representation of top and bottom songs from different countries using R. I use R packages such as tidyverse, rvest, magick, and dplyr, I demonstrated the versatility of R for data manipulation, web scraping, image processing, and visualization.

Using web scraping techniques, I collected data on the top three and bottom three songs from each country. By extracting album cover images from Apple Music and enhancing them using magick functions, I created visually appealing visuals ready for further processing. Combining the modified images with annotations such as country name, song ranking, song name, and song length, I constructed an animated GIF that effectively communicates the diversity of music trends across countries.

This project showcases my proficiency in using R for statistical analysis and programming, as well as my ability to craft compelling visual narratives. By incorporating data manipulation, web scraping, image processing, and visualization techniques, I presented insights into the top and bottom songs from various countries in a captivating and informative manner.

Final reflection

Throughout Module 5: Creating data from digital sources, I learned the importance of leveraging various digital sources to collect and create meaningful datasets. One important idea that stood out to me was the power of web scraping in extracting data from websites and online platforms. This technique allowed me to programmatically collect data that would have been otherwise time-consuming or challenging to obtain manually. By utilizing tools like rvest in R, I gained the ability to scrape and parse data from HTML pages, opening up a wide range of possibilities for data collection and analysis.

Reflecting on my learning across all the projects in the course, I am amazed at how much I have grown in my data analysis skills and technical abilities. I learned valuable techniques for data cleaning, manipulation, and visualization using R and various packages like dplyr, ggplot2, and tidyr. The projects provided hands-on experience in working with real-world datasets, allowing me to apply statistical techniques and explore patterns and insights.

As I move forward, I am curious to learn more about advanced topics such as machine learning and predictive modeling. I am intrigued by the potential of using algorithms and models to uncover hidden patterns and make predictions based on data. Additionally, I would like to delve deeper into data storytelling and effective communication of data insights. Understanding how to convey complex information in a clear and compelling manner will be valuable in driving impactful decision-making.

Overall, the course has been an enriching journey that has equipped me with practical data analysis skills, a solid foundation in R coding, and a thirst for continuous learning in the field of data science.

Appendix

library(tidyverse)
library(rvest)

#5 countries top 100 url vector
url_vector = c(url_ie = "https://music.apple.com/us/playlist/top-100-ireland/pl.3b47111ed6b7461eae67fadf895d56db",
               url_usa = "https://music.apple.com/us/playlist/top-100-usa/pl.606afcbb70264d2eb2b51d8dbcfa6a12",
               url_nz = "https://music.apple.com/us/playlist/top-100-new-zealand/pl.d8742df90f43402ba5e708eefd6d949a",
               url_sa = "https://music.apple.com/us/playlist/top-100-south-africa/pl.447bd05172824b89bd745628f7f54c18",
               url_sg = "https://music.apple.com/us/playlist/top-100-singapore/pl.4d763fa1cf15433b9994a14be6a46164")


#automating the apple df with 5 vars.
song_data = map_df(1: length(url_vector), function(i){
  Sys.sleep(2)
  
  page = read_html(url_vector[i])

  track_id <- page %>%
    html_elements(".songs-list") %>%
    html_elements("a") %>%
    html_attr("href") %>%
    .[str_detect(., "/song/")] %>%
    str_remove_all("https://(.*)/song/(.*)/")
  
  album_id <- page %>%
    html_elements(".songs-list__col.songs-list__col--tertiary") %>%
    html_elements(".songs-list__song-link-wrapper") %>%
    html_elements("a") %>%
    html_attr("href") %>%
    str_remove_all("https://(.*)/album/(.*)/")
  
  ranking <- page %>%
    html_elements(".songs-list-row__rank") %>%
    html_text2() %>%
    parse_number()
  
  country <- page %>%
    html_element(".headings__title") %>%
    html_text2() %>%
    str_remove_all("Top 100: ")
  
  song_name <- page %>%
    html_elements(".songs-list-row__song-name") %>%
    html_text2()
  
  song_length <- page %>% 
  html_elements(".songs-list-row__controls") %>%
    html_text2() %>%  
    str_remove_all("PREVIEW") %>%
    str_remove_all("\n")  
  
  
  return(tibble(track_id, song_length, song_name, ranking, album_id, country, date_scraped = now()))
  
})

# saving the data frame.
saveRDS(song_data, "song_data.rds")  

write.csv(song_data, "song_data.csv", row.names=FALSE)
library(tidyverse)
library(rvest)

# opening the previous data frame.
song_data <- readRDS("song_data.rds")


#scrapping the data

album_ids <- song_data$album_id %>% 
  unique()
len <- length(album_ids)

album_data <- map_df(album_ids, function(i){
  Sys.sleep(2)
  
  album_id <- album_ids[i]
  
  url <- paste0("https://music.apple.com/us/album/", album_id)
  
  page <- read_html(url)
  
  album_info <- page %>%
    html_elements(".footer-body") %>%
    html_elements(".description") %>%
    html_text() %>%
    str_split("\n") %>%
    unlist()
  
  album_release_date <- album_info[1] %>%
    mdy()
  
  genre <- page %>%
    html_elements(".headings__metadata-bottom") %>%
    html_text2() %>%
    split("  ") %>%
    str_remove_all(" · 2023")
  
  
  artist_name <- page %>%
    html_elements(".svelte-d0m3dm .svelte-1nh012k") %>%
    html_text2() 
  
  
  return(tibble(album_id,album_release_date,artist_name,genre))
  
})

# post processing
album_data <- album_data %>%
  filter(!artist_name=="Preview") %>%
  mutate(genre = sub(" ·.*", "", genre))

saveRDS(album_data, "playlist_data.rds")


# joining both data frames and save it.
playlist_data <- inner_join(song_data, album_data, by = c("album_id" = "album_id"))

saveRDS(playlist_data, "playlist_data.rds")
library(tidyverse)
library(ggplot2)
library(lubridate)
library(scales)
library(cowplot)

# Manipulate the data
playlist_data <- readRDS("playlist_data.rds")

playlist_data <- playlist_data %>%
  mutate(
    album_release_date = as.Date(album_release_date),
    song_length_seconds = as.numeric(str_replace(song_length, ":", ".")),
    year = year(album_release_date)
  )

# Man song length by year
mean_length_by_year <- playlist_data %>%
  group_by(year) %>%
  summarize(mean_length_seconds = mean(song_length_seconds))

# Plot for mean song length by year
mean_length_plot <- ggplot(mean_length_by_year, aes(x = year, y = mean_length_seconds)) +
  geom_line(color = "#e41a1c", size = 1.5) +
  geom_point(color = "#1f77b4", size = 3, fill = "white", shape = 21) +
  labs(
    title = "Mean Song Length by Year",
    x = "Year",
    y = "Mean Song Length (Seconds)",
    caption = "Source: music.apple.com"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.caption = element_text(size = 10, color = "gray"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    axis.line = element_line(color = "black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "#f7f7f7"),  # Change the background color here
    axis.title.y = element_text(size = 8)  # Reduce the y-axis label size
  )

# Total count of songs by year
song_count_by_year <- playlist_data %>%
  group_by(year) %>%
  summarize(total_count = n())

# Plot for realesed songs by year
count_plot <- ggplot(song_count_by_year, aes(x = year, y = total_count)) +
  geom_bar(stat = "identity", fill = "#ff7f0e", alpha = 0.8) +
  labs(
    title = "Total Songs Released by Year",
    x = "Year",
    y = NULL,
    caption = "Source: music.apple.com"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.caption = element_text(size = 10, color = "gray"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    axis.line = element_line(color = "black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.ticks.length = unit(0.2, "cm")
  )

# Combine the plots and saving.
combined_plot <- plot_grid(
  mean_length_plot, count_plot, ncol = 1, align = "v",
  rel_heights = c(1.5, 1.25)
)

plot_layout <- theme(
  plot.margin = margin(40, 50, 20, 50),
  plot.background = element_rect(fill = "#f7f7f7"),  
  panel.background = element_rect(fill = "white"),
  panel.border = element_rect(color = "black", fill = NA, size = 1),
  plot.caption = element_text(hjust = 0, margin = margin(t = 10)),
  axis.line.x = element_line(color = "black"),
  axis.line.y = element_line(color = "black")
)

final_plot <- combined_plot + plot_layout

ggsave("song_vis.png", final_plot, width = 8, height = 5, units = "in", dpi = 300)
library(tidyverse)
library(rvest)
library(magick)
library(httr)
library(dplyr)

# Top 3 songs by rank and country
top_songs <- song_data %>%
  arrange(country, ranking) %>%
  group_by(country) %>%
  slice_head(n = 3)

# Bottom 3 songs by rank and country
bottom_songs <- song_data %>%
  arrange(country, ranking) %>%
  group_by(country) %>%
  slice_tail(n = 3)

top_ids <- top_songs$album_id
len_top <- length(top_ids)


# Data manipulation and new data frame to identify Top and Bottom songs.
tops_data <- map_df(top_ids, function(top_id) {
  Sys.sleep(2)
  
  url <- paste0("https://music.apple.com/us/album/", top_id)
  
  page <- read_html(url)
  
  picture <- page %>%
    html_node("meta[property='og:image']") %>%
    html_attr("content")
  
  return(tibble(top_id, picture))
})

bottom_ids <- bottom_songs$album_id

bottom_data <- map_df(bottom_ids, function(bottom_id) {
  Sys.sleep(2)
  
  url <- paste0("https://music.apple.com/us/album/", bottom_id)
  
  page <- read_html(url)
  
  picture <- page %>%
    html_node("meta[property='og:image']") %>%
    html_attr("content")
  
  return(tibble(bottom_id, picture))
})

tops_data <- tops_data %>% distinct()

# Joining data frame
top_song_df <- left_join(top_songs, tops_data, by = c("album_id" = "top_id"))

tops_data <- tops_data %>% distinct()

bottom_song_df <- left_join(bottom_songs, bottom_data, by = c("album_id" = "bottom_id"))

saveRDS(top_song_df, "top_song_df.rds")
saveRDS(bottom_song_df, "bottom_song_df.rds")


top_df <- readRDS("top_song_df.rds")
bottom_df <- readRDS("bottom_song_df.rds")

write.csv(top_df, "top_df.csv", row.names=FALSE)

bottom_df%>% glimpse()
write.csv(top_df, "bottom_df.csv", row.names=FALSE)



# Tops songs gif creation

temp_dir <- "C:/Users/64274/Documents/UoA/2023/Sem 1/STATS 220/Project 5/temp" 
output_dir <- "C:/Users/64274/Documents/UoA/2023/Sem 1/STATS 220/Project 5/pro_images"
if (!dir.exists(temp_dir)) {
  dir.create(temp_dir, recursive = TRUE)  
}

if (!dir.exists(output_dir)) {
  dir.create(output_dir)  
}

for (i in 1:nrow(top_df)) {
  response <- GET(top_df$picture[i])
  content_type <- headers(response)$`content-type`
  ext <- sub(".*?/", "", content_type)  
  file_path <- file.path(temp_dir, paste0("image_", i, ".", ext))
  writeBin(content(response, "raw"), file_path)
  
  image <- image_read(file_path)
  image_width <- image_info(image)$width
  image_height <- image_info(image)$height
  
  rect_height <- image_height + 400  
  rect <- image_blank(image_width, rect_height, "#FFDAB9")
  
  offset_x <- (image_width - rect_width) / 2
  offset_y <- (image_height - rect_height) / 2
  
  image_with_rect <- image_composite(rect, image, "over", offset = paste0("+", offset_x, "+", offset_y))
  
  country_name <- top_df$country[i]
  image_with_rect <- image_annotate(
    image_with_rect,
    country_name,
    size = 60,
    color = "black",
    gravity = "north",
    location = "+0+30"
  )
  
  rank_number <- as.character(top_df$ranking[i])
  image_with_rect <- image_annotate(
    image_with_rect,
    rank_number,
    size = 80,
    color = "black",
    gravity = "north",
    location = "+0+110"
  )
  
  song_name <- top_df$song_name[i]
  image_with_rect <- image_annotate(
    image_with_rect,
    song_name,
    size = 40,
    color = "black",
    gravity = "south",
    location = "+0+120"
  )
  
  song_length <- top_df$song_length[i]
  image_with_rect <- image_annotate(
    image_with_rect,
    song_length,
    size = 50,
    color = "black",
    gravity = "south",
    location = "+0+40"
  )
  
  output_file_path <- file.path(output_dir, paste0("image_", i, ".", ext))
  image_write(image_with_rect, output_file_path)
}

# Create the GIF
image_files <- list.files(output_dir, full.names = TRUE)
images <- image_read(image_files)
gif_path <- "C:/Users/64274/Documents/UoA/2023/Sem 1/STATS 220/Project 5/Tops.gif"  
image_animate(images, delay = 100) %>% image_write(gif_path)



# Bottoms songs gif creation

temp_dir <- "C:/Users/64274/Documents/UoA/2023/Sem 1/STATS 220/Project 5/temp"  
output_dir <- "C:/Users/64274/Documents/UoA/2023/Sem 1/STATS 220/Project 5/bottom_files"  

if (!dir.exists(temp_dir)) {
  dir.create(temp_dir, recursive = TRUE) 
}

if (!dir.exists(output_dir)) {
  dir.create(output_dir)  
}

Bottom_list <- list()  

for (i in 1:nrow(bottom_df)) {
  response <- GET(bottom_df$picture[i])
  content_type <- headers(response)$`content-type`
  ext <- sub(".*?/", "", content_type)  
  file_path <- file.path(temp_dir, paste0("image_", i, ".", ext))
  writeBin(content(response, "raw"), file_path)
  
  image <- image_read(file_path)
  image_width <- image_info(image)$width
  image_height <- image_info(image)$height
  
  rect_height <- image_height + 400  
  rect <- image_blank(image_width, rect_height, "#FFFFE0")
  
  offset_x <- 0
  offset_y <- 200
  
  image_with_rect <- image_composite(rect, image, "over", offset = paste0("+", offset_x, "+", offset_y))
  
  country_name <- bottom_df$country[i]
  image_with_rect <- image_annotate(
    image_with_rect,
    country_name,
    size = 60,
    color = "blue",
    gravity = "north",
    location = "+0+30"
  )
  
  rank_number <- as.character(bottom_df$ranking[i])
  image_with_rect <- image_annotate(
    image_with_rect,
    rank_number,
    size = 80,
    color = "blue",
    gravity = "north",
    location = "+0+110"
  )
  
  song_name <- bottom_df$song_name[i]
  image_with_rect <- image_annotate(
    image_with_rect,
    song_name,
    size = 40,
    color = "blue",
    gravity = "south",
    location = "+0+120"
  )
  
  song_length <- bottom_df$song_length[i]
  image_with_rect <- image_annotate(
    image_with_rect,
    song_length,
    size = 50,
    color = "blue",
    gravity = "south",
    location = "+0+40"
  )
  
  output_file_path <- file.path(output_dir, paste0("image_", i, ".", ext))
  image_write(image_with_rect, output_file_path)
}

# Create the GIF
image_files <- list.files(output_dir, full.names = TRUE)
images <- image_read(image_files)
gif_path <- "C:/Users/64274/Documents/UoA/2023/Sem 1/STATS 220/Project 5/Bottoms.gif"  
image_animate(images, delay = 100) %>% image_write(gif_path)